Using clouds for data - intensive computing in proteomics

نویسنده

  • Douglas Baxter
چکیده

Abstract Type: Position Proteomics is an integral part of systems biology research and one that has become largely a dataintensive field. Due to advances in proteomics technologies, mass spectrometry data archives have been on the rise, as are sequence data generated from next-generation sequencing technologies. However, data mining and applying new methods to existing data sets is currently limited by the need to move and manage large sets of data. For example, mining a proteomics archive of MS/MS spectra for post-translational modifications is currently done on only small subsets of data, not the entire archive which contains over a billion MS spectra occupying over 150 TB as of September 2009. The problem is compounded when trying to co-analyze proteomics archives and nucleotide data from a broad spectrum of sequenced organisms – a capability needed when dealing with environmental samples. Application of the methods is limited by the manual labor required to gather all the data on a single file system. Yet integrated analysis of data from different sources has become far more important for furthering discovery. Cloud computing is expected to be an ideal computing model that can address these limitations and support the massive scale integration and analysis of proteomics and genomics data in a way that is transparent to the application scientist. In this presentation we will identify the different ways & associated challenges that will be faced while developing cloud computing frameworks for proteomics applications. For ease of exposition, we will organize the presentation from the perspective of two types of core compute operations that are prevalent in proteomics & metaproteomics analysis: i) Database search – the primary example of this application class is peptide identification from MS/MS spectral data; and ii) Large-scale graph analysis – the primary example of this class is protein family characterization which typically involves computational of large-scale all-against-all sequencesequence, profile-profile and sequence-profile comparisons. Both these applications are central to the structural and functional characterization of proteins represented from single organism to more complex microbial communities. These applications harbor a high potential to benefit from cloud computing because of large sizes of data and a broad user-base. Yet, their portability into the paradigm throws several challenges, both algorithmic and systemic. Historically, the codebase for these applications have been serial and there are very few implementations that support even traditional parallelization (i.e., on distributed and shared memory machines). Therefore, algorithmic innovations are required to map the underlying problem space to a Map-Reduce model, which is becoming the de facto standard for cloud computing. Furthermore, the evolution and continued upgradation of cloud computing technologies and infrastructures, at both the architectural and system software levels, is poised to impede transition and wider adoption. To jumpstart discussion, we will present ideas and preliminary findings of our on-going research in the area. The overarching goal is to identify the merits and current limitations that exist along the path toward implementing the grand vision of building cloud computing infrastructures for proteomics research and applying them toward transformative advancement of scientific discovery.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

3D Detection of Power-Transmission Lines in Point Clouds Using Random Forest Method

Inspection of power transmission lines using classic experts based methods suffers from disadvantages such as highel level of time and money consumption. Advent of UAVs and their application in aerial data gathering help to decrease the time and cost promenantly. The purpose of this research is to present an efficient automated method for inspection of power transmission lines based on point c...

متن کامل

An Architecture for Security and Protection of Big Data

The issue of online privacy and security is a challenging subject, as it concerns the privacy of data that are increasingly more accessible via the internet. In other words, people who intend to access the private information of other users can do so more efficiently over the internet. This study is an attempt to address the privacy issue of distributed big data in the context of cloud computin...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

Performance Evaluation of Data Intensive Computing In the Cloud

Big data is a topic of active research in the cloud community. With increasing demand for data storage in the cloud, study of data-intensive applications is becoming a primary focus. Data-intensive applications involve high CPU usage for processing large volumes of data on the scale of terabytes or petabytes. While some research exists for the performance effect of data intensive applications i...

متن کامل

Multi-dimensional Resource Allocation for Data-intensive Large-scale Cloud Applications

Large scale applications are emerged as one of the important applications in distributed computing. Today, the economic and technical benefits offered by the Cloud computing technology encouraged many users to migrate their applications to Cloud. On the other hand, the variety of the existing Clouds requires them to make decisions about which providers to choose in order to achieve the expected...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009